The dataset used for developed this project is concerning chronic kidney disease and it is available on UC Irvine Machine Learning Repository (Rubini and Eswaran 2015). It contains medical data and laboratory parameters from \(400\) patients, distributed across \(24\) variables including blood pressure, potassium or appetite, among others.
The information contained in the dataset can be classify based on its data type:
Quantitative.
age; Age of the patient in years
bp; Blood pressure (in mm/Hg).
bgr; Blood glucose random (in mgs/dl).
bu; Blood urea (in mgs/dl).
sc; Serum creatinine (in mgs/dl).
sod; Sodium (mEq/L).
pot; Potassium (mEq/L).
hemo; Hemoglobin (gms).
pcv; Packed cell volume.
wbcc; White blood cell count (cells/cumm).
rbcc; Red blood cell count (millions/cmm).
Qualitative, Categorical. Variables with order.
sg; Specific gravity (possible levels:
1.005,1.010,1.015,1.020,1.025)
al; Level of albumin (possible levels:
0,1,2,3,4,5).
su; Level of sugar (possible levels:
0,1,2,3,4,5).
Qualitative, Binary.
rbc; Whether the red blood cells is normal or
abnormal.
pc; Whether the pus cell is normal or
abnormal.
pcc; Presence of pus cell clumps.
ba; Whether there the patient has a bacteria or
not.
htn; Presence of hypertension.
dm; Presence of diabetes mellitus.
cad; Presence of coronary artery disease.
appet; Whether the appetite of the patient is good
or poor.
pe; Presence of pedal edema.
ane; Presence of anemia.
Target.
class; Indicates the presence (ckd) or absence (notckd)
of chronic kidney disease.By analyzing the dataset, the idea is to identify the key factors that allow to detect the presence of chronic kidney disease. This prediction task is a binary classification problem where the goal is to forecast one of the two possible outcomes: ckd (presence of the disease) or notckd (absence of the disease).
Such insights enable to develop strategies to prevent the progression of the disease and improve the quality of life for the affected patients.
This section focuses on preparing the dataset for exploratory data analysis (EDA) and applying the computational models. This include loading the necessary material, handling missing values, encoding categorical variables, and splitting the dataset into training and testing subsets, among other tasks.
Load libraries. Several R libraries are needed for the project in order to work properly.
library(skimr) # Summary statistics
library(caret) # Models
library(rattle) # Display DT diagrams
library(mice) # Imputation missing data
library(VIM) # Plots for imputation performance
library(GGally) # Plots of the continuous variables
library(psych) # Compute the skewness and correlation matrix
library(polycor) # Compute correlation heatmap mix data type
library(corrplot) # Correlation matrix
library(pROC)
library(tidyverse) # Include ggplot2, dplyr, among others
library(plotly) # Interactive plot
library(reshape2) # Data reshaping (melt)
library(gridExtra) # For grid.arrange
library(pander) # Display simple tables
Load dataset. The dataset was stored in an .arff format. To open the file, I made some arrangements using Python enviroment, I attached the code used in order to generate the .csv so I can work with during the rest of the project.
# Require libraries ------------------------------------------------------------
from scipy import io
import pandas as pd
# Load .arff file: extract the df and metadata ---------------------------------
dataframe, meta = io.arff.loadarff('data/chronic_kidney_disease_full.arff')
# Transform into pd df for handling
dataframe = pd.DataFrame(dataframe) # Some variables are into b'value' (bad format)
# Fixing bad format ------------------------------------------------------------
def bytes_to_int(byte_value):
try:
return byte_value.decode('utf-8') # Decode bytes to a string
except AttributeError:
return byte_value # When it is already a string
# Apply the conversion function to the specified columns in df
for column in dataframe.columns:
dataframe[column] = dataframe[column].apply(bytes_to_int)
# Save the df to .csv file -----------------------------------------------------
dataframe.to_csv("data/chronic_kidney_disease_full.csv", index=False)
# Load new .csv dataset
data <- read.csv("data/chronic_kidney_disease_full.csv", header = TRUE, sep = ",")
First look at the dataset. Brief overview of the dataset for further preprocessing steps.
glimpse(data)
## Rows: 400
## Columns: 25
## $ age <dbl> 48, 7, 62, 48, 51, 60, 68, 24, 52, 53, 50, 63, 68, 68, 68, 40, 4…
## $ bp <dbl> 80, 50, 80, 70, 80, 90, 70, NA, 100, 90, 60, 70, 70, 70, 80, 80,…
## $ sg <chr> "1.020", "1.020", "1.010", "1.005", "1.010", "1.015", "1.010", "…
## $ al <chr> "1", "4", "2", "4", "2", "3", "0", "2", "3", "2", "2", "3", "3",…
## $ su <chr> "0", "0", "3", "0", "0", "0", "0", "4", "0", "0", "4", "0", "1",…
## $ rbc <chr> "?", "?", "normal", "normal", "normal", "?", "?", "normal", "nor…
## $ pc <chr> "normal", "normal", "normal", "abnormal", "normal", "?", "normal…
## $ pcc <chr> "notpresent", "notpresent", "notpresent", "present", "notpresent…
## $ ba <chr> "notpresent", "notpresent", "notpresent", "notpresent", "notpres…
## $ bgr <dbl> 121, NA, 423, 117, 106, 74, 100, 410, 138, 70, 490, 380, 208, 98…
## $ bu <dbl> 36, 18, 53, 56, 26, 25, 54, 31, 60, 107, 55, 60, 72, 86, 90, 162…
## $ sc <dbl> 1.2, 0.8, 1.8, 3.8, 1.4, 1.1, 24.0, 1.1, 1.9, 7.2, 4.0, 2.7, 2.1…
## $ sod <dbl> NA, NA, NA, 111.0, NA, 142.0, 104.0, NA, NA, 114.0, NA, 131.0, 1…
## $ pot <dbl> NA, NA, NA, 2.5, NA, 3.2, 4.0, NA, NA, 3.7, NA, 4.2, 5.8, 3.4, 6…
## $ hemo <dbl> 15.4, 11.3, 9.6, 11.2, 11.6, 12.2, 12.4, 12.4, 10.8, 9.5, 9.4, 1…
## $ pcv <dbl> 44, 38, 31, 32, 35, 39, 36, 44, 33, 29, 28, 32, 28, NA, 16, 24, …
## $ wbcc <dbl> 7800, 6000, 7500, 6700, 7300, 7800, NA, 6900, 9600, 12100, NA, 4…
## $ rbcc <dbl> 5.2, NA, NA, 3.9, 4.6, 4.4, NA, 5.0, 4.0, 3.7, NA, 3.8, 3.4, NA,…
## $ htn <chr> "yes", "no", "no", "yes", "no", "yes", "no", "no", "yes", "yes",…
## $ dm <chr> "yes", "no", "yes", "no", "no", "yes", "no", "yes", "yes", "yes"…
## $ cad <chr> "no", "no", "no", "no", "no", "no", "no", "no", "no", "no", "no"…
## $ appet <chr> "good", "good", "poor", "poor", "good", "good", "good", "good", …
## $ pe <chr> "no", "no", "no", "yes", "no", "yes", "no", "yes", "no", "no", "…
## $ ane <chr> "no", "no", "yes", "yes", "no", "no", "no", "no", "yes", "yes", …
## $ class <chr> "ckd", "ckd", "ckd", "ckd", "ckd", "ckd", "ckd", "ckd", "ckd", "…
Ensure dataset consistency.
The dataset contains missing values represented as “?” as well as “NA”. To ensure consistency over the data, all missing values are converted to the same format, making easier future handling.
# Convert "?" to NA
data <- data.frame(lapply(data, function(x) ifelse(x == "?", NA, x)))
Duplicate data.
There is no duplicated data, each row represent a different patient.
duplicates <- data[duplicated(data), ]
Encoding qualitative variables.
The goal is to get a dataset that our machine learning models know
how to manage. This include, turn it into factor the categorical
variables. Meanwhile, binary encoding is easy, for the multi-state
variables strategy is slightly different due to the presence of natural
order. This variables sg (specific gravity),
al (albumin) and su (sugar), represent
different ordered levels of severity or concentration that have impact
on medical conditions, such as chronic kidney disease. In order to make
the proper encoding, it is mandatory to specify the importance of the
order when creating this levels using the factor function.
# Binary variables -------------------------------------------------------------
# (Included the target variable)
bin_and_target_vars <- c("rbc", "pc", "pcc", "ba", "htn", "dm", "cad", "appet", "pe", "ane", "class")
for (var in bin_and_target_vars) {
data[[var]] <- as.factor(data[[var]])
}
# Categorical variables --------------------------------------------------------
data$sg <- factor(data$sg,
levels = c("1.005", "1.010", "1.015", "1.020", "1.025"),
ordered = TRUE)
data$al <- factor(data$al,
levels = c("0", "1", "2", "3", "4", "5"),
ordered = TRUE)
data$su <- factor(data$su,
levels = c("0", "1", "2", "3", "4", "5"),
ordered = TRUE)
# Checkings --------------------------------------------------------------------
str(data) # To check all qualitative vars are in factor type
levels(data$sg) # Check the levels of a specific variable
table(data$al) # Check how many observation per level there are
Drop categories.
When analysing al and su parameters, it can
be observed that level \(5\) in both
variables contains just one observations. This issue can introduced
instability as well as future problem in missing value imputation. To
address this, observations in these levels are merged into level \(4\), under the assumption that the
distinction between these levels is not really significantly.
table(data$sg)
##
## 1.005 1.010 1.015 1.020 1.025
## 7 84 75 106 81
table(data$al)
##
## 0 1 2 3 4 5
## 199 44 43 43 24 1
table(data$su)
##
## 0 1 2 3 4 5
## 290 13 18 14 13 3
# Reduce categories on al and su -----------------------------------------------
# Transforming into numeric for easy replacement
data$al <- as.numeric(as.character(data$al))
data$su <- as.numeric(as.character(data$su))
# Change level: 5 -> 4
data$al[data$al == 5] <- 4
data$su[data$su == 5] <- 4
# Turn into factors (order) again without level 5
data$al <- factor(data$al,
levels = c("0", "1", "2", "3", "4"),
ordered = TRUE)
data$su <- factor(data$su,
levels = c("0", "1", "2", "3", "4"),
ordered = TRUE)
# Checkings --------------------------------------------------------------------
table(data$al)
table(data$su)
Next step focus on identify missing values on the dataset that can be due to various factors such as typo errors or unreported patient information.
This dataset contains some missing values and, their must be handled in order to use all machine learning model proposed during lecture. The main reason is that not all models support NA’s. The below table shows the missing values by columns express in percentage. Because none of this variables exceed a threshold of \(60\)% of missing values, it is recommend to not delete any variable avoiding loosing this information.
total_na <- sum(is.na(data)) # 1012 NAs
# Counting and sorting NA's
missing_values <- colSums(is.na(data))
sort_missing_values <- sort(missing_values, decreasing = TRUE)
# Counting and sorting NA's, display results in %
percentage_na <- missing_values/nrow(data)*100
sort(percentage_na, decreasing = TRUE)
## rbc rbcc wbcc pot sod pcv pc hemo su sg al bgr bu
## 38.00 32.75 26.50 22.00 21.75 17.75 16.25 13.00 12.25 11.75 11.50 11.00 4.75
## sc bp age pcc ba htn dm cad appet pe ane class
## 4.25 3.00 2.25 1.00 1.00 0.50 0.50 0.50 0.25 0.25 0.25 0.00
First approach is to check if a patient record has not been well reported. It has been found \(7\) observations where \(11\) columns contains invalid data. These records are removed because, approximately, \(45.83\)% information of the patient is missing. Maintaining such rows would required data fabrication and will directly affect the future analysis making it weaken.
# Identifying observations with at least 1 NA ----------------------------------
obvs_with_na <- data[rowSums(is.na(data)) > 0, ]
missing_counts <- rowSums(is.na(obvs_with_na))
# df for storing number of NAs and corresponding indexes
missing_info <- data.frame(
count = names(table(missing_counts)),
indexes = tapply(names(missing_counts), missing_counts, paste, collapse = ", ")
)
pander(missing_info)
| count | indexes |
|---|---|
| 1 | 13, 16, 19, 25, 26, 27, 32, 33, 40, 43, 44, 47, 57, 63, 71, 78, 81, 95, 98, 102, 106, 108, 109, 111, 112, 127, 138, 146, 155, 164, 171, 174, 179, 185, 192, 194, 200, 201, 218, 224, 242, 244, 245, 294, 333 |
| 2 | 5, 6, 9, 36, 37, 41, 45, 50, 64, 74, 88, 100, 103, 104, 142, 159, 173, 205, 240, 275, 288, 301, 303, 304, 310, 335, 337, 350, 351, 364, 366, 379, 382 |
| 3 | 1, 3, 7, 8, 20, 52, 55, 70, 80, 118, 125, 136, 151, 153, 156, 165, 168, 176, 183, 196, 207, 208, 215, 220, 228, 231, 235, 241, 277, 289, 295, 298, 313, 317, 329, 331, 347 |
| 4 | 17, 34, 39, 46, 53, 54, 62, 67, 84, 101, 116, 121, 130, 135, 140, 144, 147, 150, 169, 178, 181, 186, 219, 234, 236, 238, 248, 274, 291, 320, 325 |
| 5 | 2, 11, 30, 38, 48, 51, 61, 69, 82, 89, 90, 96, 97, 99, 115, 124, 132, 133, 141, 161, 175, 180, 184, 187, 188, 222, 225, 246, 281, 284, 296, 316, 323 |
| 6 | 66, 76, 107, 113, 119, 122, 137, 163, 170, 217, 221, 269 |
| 7 | 22, 35, 56, 73, 79, 117, 120, 139, 152, 157, 162, 193, 198, 202, 203, 209, 212, 232, 237, 239 |
| 8 | 14, 18, 42, 58, 65, 68, 189, 210 |
| 9 | 29, 83, 86, 114, 123, 126, 143, 167, 195, 204, 206, 233 |
| 10 | 24, 31, 110, 216 |
| 11 | 60, 87, 105, 149, 166, 223, 229 |
# Delete patients with 11 NA variables -----------------------------------------
rows_to_delete <- c(60, 87, 105, 149, 166, 223, 229)
data <- data[!rownames(data) %in% rows_to_delete, ]
This step consist on divide the dataset into two subsets. The training set, takes \(80\)% of the data and it is used to train the models. This subset allows the model to learn patterns and relationships within the data. On the other hand, testing set (\(20\)% of the data) is used by unseen data to evaluate the performance of the trained model.
set.seed(1234) # For reproducibility
# Split into training (80%) and testing (20%) set
index <- createDataPartition(data$class, p=0.8, list=FALSE)
training <- data[ index,]
testing <- data[-index,]
nrow(training) # Number of observation for training
nrow(testing) # Number of observation for testing
In this section, we cover the imputation of the missing data in the
dataset using the MICE technique. This method is provided by the R
library mice (van Buuren and Groothuis-Oudshoorn 2011)
and on this script, we apply a slightly different version that allow us
dealing with training and testing sets for avoiding data leakage.
Some important consideration on the configuration of this process are
the election of ‘rf’ (random forest) method across all variables, in
order to handle possible non-linear relationships. Moreover, we generate
\(4\) imputations for the training set.
However, for this project, we will only consider the first imputation.
After generating the imputed data in the training set, we can complete
the missing values using the complete() function, which
fills in missing values with those from the first imputation. Finally,
and where we modify the classic version, we apply
mice.reuse() function (Prockenschub 2024) which let us to
impute the missing values in the testing set while avoiding data
leakage. This function use the model developed from the training
imputation and apply it on the testing set, thus ensuring greater
reliability in the imputation of values through the use of learned
patterns.
# MICE Imputation --------------------------------------------------------------
meth <- rep("rf", length = ncol(training)) # Set all methods to RF model
# Generates 4 imputations
mice_model <- mice(training,
m = 4, maxit = 5, method = meth,
seed = 1234, print = FALSE
)
# Complete the TRAIN data with the first imputation
training_imp <- complete(mice_model, 1)
# Function to impute new observations based on the previous imputation model
source("https://raw.githubusercontent.com/prockenschaub/Misc/master/R/mice.reuse/mice.reuse.R")
# Apply the imputation model to the TESTING set and extract the first imputation
testing_imp <- mice.reuse(mice_model, testing, maxit = 1)[[1]]
Performance diagnostic of how the imputation is working.
To check there are no missing data anymore, we use a graph from the
library(VIM) (Kowarik and Templ 2016), which display
the percentage of NA’s.
Also, to assess whether the imputed data adequately reflects the
distributions of the original data, we can use the
densityplot() function. This function shows the marginal
distribution of the observed data in blue, and in red, it shows the four
densities for the predictors with initially missing data. As we can
observed, the distributions across all variables fit properly following
the original distribution. Hence, we can ensure the good quality of the
MICE model for handling NAs.
Finally, we can deeper explore the distribution of the imputed values across the first four imputations, compared to the observed data (represented by imputation number \(0\)). The red points are the imputed values which are consistently clustered across all iteration, suggesting again that the MICE algorithm is a reliable performance. Furthermore, this plot demonstrated how well the first distribution (the one we have chosen) is working.
The aim is to uncover patterns, detect outliers, and test assumptions through visual and quantitative methods.
The ggpair function from the GGally package (Schloerke et al.
2021), allow us to create a scatter plot matrix for this
variables. This matrix includes histograms for each variable along the
diagonal, showcasing their distribution. The lower triangle displays
scatter plots to explore the relationships between variable pairs, while
the upper triangle shows the correlation coefficients, providing a
quantitative measure of their relationships. This information allow us
to identify the shape distribution, tendencies, outliers, among other
things, thanks to different views in one grid.
General insights.
psych (William Revelle
2023) package. Log-transformation are applied to variables
with positive skewness (bp, bgr,
bu, sc, pot and
wbcc), while a square transformation is used for
sod with negative skewness. Notice that the threshold for
decision-making is set at skewness values above \(1\) or below \(-1\).| variable | skewness |
|---|---|
| bp | 1.8 |
| bgr | 2 |
| bu | 2.6 |
| sc | 8.4 |
| sod | -7.1 |
| pot | 11.5 |
| hemo | -0.3 |
| pcv | -0.4 |
| wbcc | 1.4 |
| rbcc | 0 |
Standardization of Data. The data will be standardized to benefit models that are sensitive to the scale of the data, such as K-Nearest Neighbors and Neural Networks. This step will not affect negatively to the other models like Gradient Boosting, which are less sensitive to scaling.
About correlation coefficients. It is worth mentioning that this
dataset contains high correlation between some variables, this value is
identify when coefficients are grater than \(0.7\) or lower than \(-0.7\)). For instance, packed cell volume
(pcv) and hemoglobin (hemo) have a strong
positive correlation coefficient of \(0.869\). From a biologically point of view
makes sense, both parameter are related with the amount of oxygen going
through human blood. Hemoglobin is the protein in red blood cells that
carries oxygen, while the packed cell volume represents the volume
percentage of red blood cells in blood. Moreover, red blood cells count
(rbcc) is also highly correlated with hemo
(\(0.770\)) and pcv (\(0.766\)) indicating that if one of this
parameter increase the other one will increase too.
Outliers. It is notably the presence of some outliers over the variables. This observations (referring to the outliers) represent individuals with extreme values probably due to underlying medical conditions or measurement errors. Notice that this observations may influence directly the model performance and the conclusion derived from the data analysis.
Relation with the output. A deeper exploration of
the most correlated variables with the target, class, let
some important insights on identify chronic kidney disease and healthy
patients. First pattern obtained is that people with chronic kidney
disease (class = ckd) tend to have lower values of
hemoglobin (hemo), packed cell volume (pcv),
and red blood cell count (rbcc) as indicated by the red
density distributions. In contrast, healthy patients (class =
notckd) have higher values for these parameters, as shown
by blue density distribution. Moreover, boxplots demonstrated same
insight as before, with median values of hemo,
pcv, and rbcc being lower in the chronic
kidney disease group. Last distinction observed on the boxplots is about
on how the outliers exist just on unhealthy people distribution. This
situation might highlight the complexity for diagnosis and identifying
the chronic kidney disease due to weird and unexpected behavior of this
patients.
Data transformation code.
# Numeric transformation
vars_log_transf <- c('bp', 'bgr', 'bu', 'sc', 'pot', 'wbcc') # Note: No zeros on any var
########################## TRAINING dataset ##########################
training_transf <- training_imp
# Standardize the variables ----------------------------------------------------
# Log-transformation to solve right skewness
training_transf <- training_transf %>%
mutate_at(vars(all_of(vars_log_transf)), ~ log(.))
# ^2-transformation to solve negative skewness
training_transf[,c("sod")] <- (training_transf[,c("sod")])^2
# Scale the variables ----------------------------------------------------------
training_transf[c("bp", "bgr", "bu", "sc", "sod", "pot", "hemo", "pcv", "wbcc", "rbcc")] <- scale(training_transf[c("bp", "bgr", "bu", "sc", "sod", "pot", "hemo", "pcv", "wbcc", "rbcc")])
########################## TESTING dataset ##########################
testing_transf <- testing_imp
# Standardize the variables ----------------------------------------------------
# Log-transformation to solve right skewness
testing_transf <- testing_transf %>%
mutate_at(vars(all_of(vars_log_transf)), ~ log(.))
# ^2-transformation to solve negative skewness
testing_transf[,c("sod")] <- (testing_transf[,c("sod")])^2
# Scale the variables ----------------------------------------------------------
testing_transf[c("bp", "bgr", "bu", "sc", "sod", "pot", "hemo", "pcv", "wbcc", "rbcc")] <- scale(testing_transf[c("bp", "bgr", "bu", "sc", "sod", "pot", "hemo", "pcv", "wbcc", "rbcc")])
Binary variables.
Some tables are display in order to understand better the dataset.
First table indicates that most of the patients have normal levels of
red blood cells (rbc) and pus cell (pc). In
the second table, the majority of data is regarding people without pus
cell clumps (pcc) and bacteria (ba). The third
table focus on various conditions, including hypertension
(htn), diabetes mellitus (dm), coronary artery
disease (cad), pedal edema (pe), and anemia
(ane). Hypertension and diabetes mellitus seems as the more
common conditions among the patients. Meanwhile the other parameters are
present in no more than \(19\)% of the
people tested, thus these conditions might not be commontly within the
population. Finally, most people of the dataset report having a good
appetite. In summary, the binary variables are not equally distributed
on this dataset.
| Category | rbc | pc |
|---|---|---|
| abnormal | 59 | 66 |
| normal | 256 | 249 |
| Category | pcc | ba |
|---|---|---|
| notpresent | 282 | 301 |
| present | 33 | 14 |
| Category | htn | dm | cad | pe | ane |
|---|---|---|---|---|---|
| no | 198 | 208 | 290 | 256 | 266 |
| yes | 117 | 107 | 25 | 59 | 49 |
| Category | appet |
|---|---|
| good | 252 |
| poor | 63 |
The following grid of bar charts aim to show the target distribution
across each binary variable. When looking at parameter related with
infections, such as ba (bacteria) and pcc (pus
cell clumps), these are less frequent in the overall dataset, but when
they do occur, they are predominantly found in patients with kidney
disease (class = ckd). In addition, poor appetite seems to
be a common symptom in patients with ckd. Hypertension
(htn) and diabetes mellitus (dm) are variables
where storage a high number of sick patients with this conditions.
Category variables.
Next step consist on explore the \(3\) multi-state variables: specific gravity
(sg), albumin (al), and sugar
(su). Specific gravity clearly shows that levels \(1.020\) and \(1.025\) are the common levels for healthy
people. Conversely, lower states are more prevalent among the presence
of chronic kidney disease. On the other hand, healthy population is
expect to have a level \(0\) of albumin
as well as a low level of sugar. Otherwise, levels indicates a higly
link to this chronic disease.
Additionally, in order to confirm the above insights some stacked bar chart has been code. This plots shows the proportion of people with and without chronic kindey disease across the different categories of these three multi-state variables. All patterns are confirmed.
The heatmap represents the strength and direction of the relationships between the variables in the dataset, where the color intensity indicates the magnitude of the correlation. Dark colors, that can be either blue (positive) or red (negative), suggest a stronger relationship between variables. A positive coefficient means that an increase (or decrease) in one variable is associated with an increase (or decrease) in the other. Conversely, a negative correlation indicated that as one variables increase, the other decrease. And, color in the middle are interpreted as weak correlations.
For dealing with mixed data type, the hetcor function
from the psych package (William Revelle 2023) in R is used.
This enable the computation of correlation across numeric, binary and
multi-state variables as long as the qualitative variable are
represented in factors. When coding this step some inconveniences have
been encountered with al and htn variables. To
address this issue, both parameters are treated numerically to
facilitate its computation in the correlation analysis.
training_modify <- training_transf
# Some transformation needed for handling problematic variables: al, htn
training_modify$al <- as.numeric(as.character(training_modify$al))
#contrasts(training_transf$htn) # no as 0 and yes as 1
training_modify$htn <- as.numeric(training_modify$htn)
# Works for mix variables: binary, multi-state and numeric
# Qualitative variables have to be in factor
cor_matrix <- hetcor(training_modify)
cor_values <- cor_matrix$correlations
corrplot(cor_values, method="color", type="upper", order="hclust",
tl.cex=0.7, tl.col="black", tl.srt=45)
Analytical exploration.
Thanks to this exploration \(17\) strong relationships have been discovered (the threshold for filter the correlations is set to coefficients higher than \(0.7\) or lower than \(-0.7\)). However, I decided to focus on the top \(8\) with the most significant impact, which can seen on the below table.
Most of this relationships are kind of intuitive. For example, having diabetes is closely linked to higher sugar levels, and a lower amount of hemoglobin seems to be associated with anemia. Overall, it is realised how related metabolomic and blood measures are when identifying chronic kidney disease.
cor_flat <- as.data.frame(as.table(cor_values))
cor_flat <- cor_flat[order(-abs(cor_flat$Freq)),]
# Filter correlations: >0.7 or <-0.7
strong_correlations <- subset(cor_flat, abs(Freq) > 0.7 & Var1 != Var2)
# Remove duplicates rows
strong_correlations$pairID <- apply(strong_correlations[, c("Var1", "Var2")],
1,
function(x) paste(sort(x), collapse = "-"))
strong_correlations <- strong_correlations[!duplicated(strong_correlations$pairID),]
strong_correlations$pairID <- NULL # Remove ID
# Add sign
strong_correlations$Sign <- ifelse(strong_correlations$Freq > 0, "+", "-")
#pander(strong_correlations) # 17 high realtionship
head(strong_correlations, 8) # Display the highest 8 relations
## Var1 Var2 Freq Sign
## 366 pcv hemo 0.8738227 +
## 158 pcc pc -0.8393903 -
## 175 class pc 0.8240179 +
## 375 class hemo 0.8182676 +
## 120 dm su 0.8176266 +
## 374 ane hemo -0.7961312 -
## 400 class pcv 0.7805154 +
## 500 class dm -0.7779668 -
To see the distribution of our target variable (class),
we first look at the raw counts and then at the percentages to
understand the proportion of sickness patients. To conclude, it is worth
mentioning that the output varibale is not imbalanced, thereby there is
no need for upsampling or downsampling this parameter.
# Check counts of chronic kidney disease
pander(table(data$class))
| ckd | notckd |
|---|---|
| 243 | 150 |
# Results in percentage
pander(prop.table(table(data$class)) * 100)
| ckd | notckd |
|---|---|
| 61.83 | 38.17 |
The goal of this section is to evaluate the performance of some machine learning tools for the binary classification task. The aim is to predict whether the patient have chronic kidney disease or not. For each of these models, we will compute the accuracy, the kappa as well as the confusion matrix, a table which will indicate us how many instances were correctly classified and how many were misclassified. Finally, it is worth mentioning, that for this second part of the project are defined some common consideration that I will apply for all models implemented.
First, training and testing datasets are divided into features and the target variable.
# Divide into features and target both subsets
X_train <- training_transf %>% dplyr::select(-class)
y_train <- training_transf$class
X_test <- testing_transf %>% dplyr::select(-class)
y_test <- testing_transf$class
n_test <- dim(testing_transf)[1]
To ensure a robust evaluation, a train control parameter is configure: \(5\)-folds cross-validation repeated \(10\) times for optimize hyper-parameters.
# Set up train control -> 5-folds cross validation repeated 10 times
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 10,
verboseIter = TRUE, # TRUE for tracking
classProbs = TRUE)
On the other hand, caret (Kuhn and Max 2008) library’s is the
package used for code all the models. In order to understand the
capabilities for each available models and their tunable parameters, the
following commands has been used:
# Models supported by caret
names(getModelInfo())
# In order to explore which parameter can be tune
name_caret_model <- "knn" # Change me!
modelLookup(name_caret_model)
Lastly, due to time consuming hyper-parameter search the best-performing models are pre-trained, saved, and stored in a dedicated folder (named pretrained_models). This not only saves time during the analysis but also ensure the reproducibility of results for the reader.
# Loading pre-trained models from pretrained_models folder
# KNN model --------------------------------------------------------------------
knn_fit <- readRDS("pretrained_models/knn_fit.rds")
# SVM model --------------------------------------------------------------------
svm_fit <- readRDS("pretrained_models/svm_fit.rds")
# DT model ---------------------------------------------------------------------
rpart_fit <- readRDS("pretrained_models/rpart_fit.rds")
c50_fit <- readRDS("pretrained_models/c50_fit.rds")
# RF model ---------------------------------------------------------------------
rf_fit_final <- readRDS("pretrained_models/rf_fit_final.rds")
# GB model ---------------------------------------------------------------------
xgb_fit <- readRDS("pretrained_models/xgb_fit.rds")
# NN model ---------------------------------------------------------------------
nn_fit <- readRDS("pretrained_models/nn_fit.rds")
# DNN model ---------------------------------------------------------------------
dnn_fit <- readRDS("pretrained_models/dnn_fit.rds")
k-Nearest Neighbors classifier, commonly known as kNN, looks at the closest training examples in the feature space and uses a majority vote to determine the target. One of the advantages is that it makes no assumptions about the underlying data distribution which make it perfect for capturing non-linear data.
The main parameter of kNN is \(k\), which allow us to set the number of neighbors to consider for obtaining an optimal accuracy. After a search of this hyper-parameter, the best performance was achieved when \(k = 3\), indicating that the model performs right when making predictions based on the three closest neighbors.
# Hyper-parameter search -------------------------------------------------------
# K no more than 50% of the training set size.
# From lecture: preferable to try odd numbers for k
knn_grid <- expand.grid(k = seq(3, 121, by = 2))
dim(knn_grid) # 60 combinaciones
# Training the model -----------------------------------------------------------
knn_fit <- train(class ~ .,
method = "knn",
data = training_transf,
tuneGrid = knn_grid,
metric = "Accuracy",
trControl = ctrl)
print(knn_fit)
# Best values: k = 3
# Save model -------------------------------------------------------------------
saveRDS(knn_fit, file = "pretrained_models/knn_fit.rds")
# Predictions on test ----------------------------------------------------------
knn_pred <- predict(knn_fit, X_test)
knn_pred_prob <- predict(knn_fit, X_test, type = "prob")
# Performance measures ---------------------------------------------------------
confusionMatrix(knn_pred, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ckd notckd
## ckd 39 0
## notckd 9 30
##
## Accuracy : 0.8846
## 95% CI : (0.7922, 0.9459)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : 1.186e-07
##
## Kappa : 0.7692
##
## Mcnemar's Test P-Value : 0.007661
##
## Sensitivity : 0.8125
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.7692
## Prevalence : 0.6154
## Detection Rate : 0.5000
## Detection Prevalence : 0.5000
## Balanced Accuracy : 0.9062
##
## 'Positive' Class : ckd
##
When unseen data is used with the best model obtained (hyper-parameter already selected), the model’s performance achieved an accuracy of \(88.46\)%, with a \(95\)% confidence interval ranging from \(79.22\)% to \(94.59\)%. As it is expected, the Kappa metric is slightly lower, this measure considers the accuracy that could be achieved by random chance, making it a more robust metric than accuracy alone.
Furthermore, by analyzing the confusion matrix we can understand the model’s mistakes. There are no false positives, which is an advantage taking into account this type of error can be costly for the healthcare center. However, the worst mistake, from a life standpoint, occurs when predicting no disease but the patient actually has it. And this is where all errors are concentrated.
Performance of the model.
The hyper-parameter tuning plot illustrates the relationship between the number of neighbors and the model’s performance metrics, accuracy as well as kappa. While \(k\) increases, this two metrics decrease which lead us that too many neighbors make the model less accurate. The highest metric values are reached when \(k = 3\).
The variable importance plot for the k-Nearest Neighbors model
displays the relative importance of each predictor in the classification
task. As it can be observed, hemoglobin (hemo) stand out as
the most significant variable when forecasting the presence or absence
of chronic kidney disease. Nevertheless, we cannot ignore packed cell
volume (pcv), specific gravity (sg), serum
creatinine (sc) or red blood cell count (rbcc)
which still being crucial for model decision-makers.
Support Vector Machines, known also as SVM, is a classifier based on
the idea of maximizing the gap between different classes. Kernel
functions plays an important role on this models and works well with
non-linear problems. Additionally, is not influenced by the presence of
outliers which can be beneficial for this dataset and its well-known for
reducing the risk of overfitting. In this analysis, we have utilized the
radial basis function kernel in caret
(svmRadial).
# Hyper-parameter search -------------------------------------------------------
# C; compromising between low errors and weights values
# sigma; determine how far the influence of a single training example reaches
svm_grid <- expand.grid(C = c(1e-1, 1e0, 1e1, 1e2, 1e3, 1e4),
sigma = c(0.01, 0.05, 0.1, 0.5, 1, 2, 5)
)
dim(svm_grid) # 42 combinaciones
# Training the model -----------------------------------------------------------
svm_fit <- train(class ~., method = "svmRadial",
data = training_transf,
tuneGrid = svm_grid,
metric = "Accuracy",
trControl = ctrl)
print(svm_fit)
# Best values: sigma = 0.5 and C = 1
# Save model -------------------------------------------------------------------
saveRDS(svm_fit, file = "pretrained_models/svm_fit.rds")
# Predictions on test ----------------------------------------------------------
svm_pred <- predict(svm_fit, X_test)
svm_pred_prob <- predict(svm_fit, X_test, type = "prob")
# Performance measures ---------------------------------------------------------
confusionMatrix(svm_pred, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ckd notckd
## ckd 48 3
## notckd 0 27
##
## Accuracy : 0.9615
## 95% CI : (0.8917, 0.992)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : 7.08e-13
##
## Kappa : 0.9172
##
## Mcnemar's Test P-Value : 0.2482
##
## Sensitivity : 1.0000
## Specificity : 0.9000
## Pos Pred Value : 0.9412
## Neg Pred Value : 1.0000
## Prevalence : 0.6154
## Detection Rate : 0.6154
## Detection Prevalence : 0.6538
## Balanced Accuracy : 0.9500
##
## 'Positive' Class : ckd
##
The SVM model’s performance reached incredible results metrics: an accuracy of \(96.15\)% and a kappa of \(91.72\)%, which represent almost a perfect model. The confusion matrix reveals that there has not been done any worst mistake (predicting the patient does not have chronic kidney disease when he actually has). However, there are a few times where the model forecast that the patient has ckd and in reality the patient is healthy.
Performance of the model.
During the training phase, we conducted a hyper-parameter search to
balance the trade-off between ensuring low error (parameter
C) and managing the influence range of support vectors
(sigma). From all values tried the model performs best with
\(sigma = 0.5\) and \(C = 1\).
The parameter search space plot, suggest that smaller values of sigma lead us into better models. Moreover, with the appropriate sigma it seems the choice of \(C\) is less critial to the model’s performance.
The variable importance plot highlights again hemoglobin
(hemo) as the most significant variable when forecasting
the presence or absence of chronic kidney disease. Other important
parameters include packed cell volume (pcv), specific
gravity (sg) or serum creatinine (sc). As an
insight, SVM aligns with KNN in terms of the key variables, the only
difference is in their ranking for pe and rbc
variables.
In general, decision trees are well suited for dataset with missing
values or outliers on it. Although this dataset not longer contains
NA’s, using decision trees can still be advantageous due to the presence
of outliers in some of the variables. Furthermore, their ability for
automatic feature selection make it easy its interpretability. For this
kind of model we explore two options provided by the caret
library.
RPART MODEL. The RPART method, is a type of decision trees which offers simplicity and interpretability without supporting capability for boosting.
# Hyper-parameter search -------------------------------------------------------
# cp; size of the tree
rpart_grid <- expand.grid(.cp = c(0.01, 0.05, 0.1))
# Training the model -----------------------------------------------------------
rpart_fit <- train(class ~.,
method = "rpart",
data = training_transf,
tuneGrid = rpart_grid,
metric = "Accuracy",
trControl = ctrl)
print(rpart_fit)
# Best values: cp = 0.01
# Save model -------------------------------------------------------------------
saveRDS(rpart_fit, file = "pretrained_models/rpart_fit.rds")
# Predictions on test ----------------------------------------------------------
rpart_pred <- predict(rpart_fit, X_test)
rpart_pred_prob <- predict(rpart_fit, X_test, type = "prob")
# Performance measures ---------------------------------------------------------
confusionMatrix(rpart_pred, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ckd notckd
## ckd 47 1
## notckd 1 29
##
## Accuracy : 0.9744
## 95% CI : (0.9104, 0.9969)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : 4.373e-14
##
## Kappa : 0.9458
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9792
## Specificity : 0.9667
## Pos Pred Value : 0.9792
## Neg Pred Value : 0.9667
## Prevalence : 0.6154
## Detection Rate : 0.6026
## Detection Prevalence : 0.6154
## Balanced Accuracy : 0.9729
##
## 'Positive' Class : ckd
##
The evaluation of this model with the testing set achieved almost a perfect mode where the accuracy is about \(97.44\)% and the kappa is \(94.58\)%. The model’s confusion matrix revealed that the missclasification is minimum, with only two incorrect predictions (once for each type of error).
Performance of the model.
Same accuracy and kappa metrics has been obtained across different
complexity parameters (cp). Therefore, the visual
representation of the tuning process is omitted as it does not enhance
the understanding of the model’s performance. Instead, I will focus on
the decision tree diagram coding thanks to the rattle
package (Williams
2011). This interesting plot, illustrate the decision-making
process of the model. It start with hemoglobin as the primary node for
splitting the data with a threshold of \(0.17\). Then the second node, based it
decision on sg variable.
# Plot tuning hyper-parameter --------------------------------------------------
plot(rpart_fit, main = 'Hyper-parameter tuning - DT rpart')
In terms of variable importance, the rpart model has
selected \(7\) variables out of \(24\) as relevant parameters. This represent
less than a \(30\)% of the columns on
the dataset. However, once again hemo belong to the top
\(3\) more important variables.
TREE MODEL. The second model coded is the C5.0 algorithm which is an advanced decision tree that can be considered as a form of boosting approach.
# Hyper-parameter search -------------------------------------------------------
# .winnow; feature selection step conducted before modelling
# .trial; number of boosting iterations
c50_grid <- expand.grid( .winnow = c(TRUE, FALSE),
.trials=c(1, 5, 4),
.model= "tree"
)
dim(c50_grid)
# Training the model -----------------------------------------------------------
c50_fit <- train(class ~.,
method = "C5.0",
data = training_transf,
tuneGrid = c50_grid,
metric = "Accuracy",
trControl = ctrl)
print(c50_fit)
# Best values: trials = 1, model = tree and winnow = FALSE
# Save model -------------------------------------------------------------------
saveRDS(c50_fit, file="pretrained_models/c50_fit.rds")
# Predictions -----------------------------------------------------------------
c50_pred <- predict(c50_fit, X_test)
c50_pred_prob <- predict(c50_fit, X_test, type = "prob")
# Performance measures ---------------------------------------------------------
confusionMatrix(c50_pred, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ckd notckd
## ckd 47 1
## notckd 1 29
##
## Accuracy : 0.9744
## 95% CI : (0.9104, 0.9969)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : 4.373e-14
##
## Kappa : 0.9458
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9792
## Specificity : 0.9667
## Pos Pred Value : 0.9792
## Neg Pred Value : 0.9667
## Prevalence : 0.6154
## Detection Rate : 0.6026
## Detection Prevalence : 0.6154
## Balanced Accuracy : 0.9729
##
## 'Positive' Class : ckd
##
Upon application to the testing set, the C5.0 model achieved the same results to the previously decision tree model, the rpart. New data, compute same confusion matrix, accuracy and kappa values. Hence, we can conclude there is no different on how rpart and C5.0 are performing when unseen data is introduced. Even though this evaluation metrics are the same, let’s explore the differences between this two models when being created.
Performance of the model.
The model tuning revealed that configuring the winnowing parameter as FALSE yields consistently into better results, independently the number of iteration used.
Variable importance for the C5.0 model shows a limited set of
variables, with only four out of twenty four columns being significant.
Hemoglobin (hemo) is the most critical feature and specific
gravity (sg) follows as the second most important. The
model also gives some relevance to blood glucose random
(bgr) and albumin (al), however in a minor
degree.
Random forest are an evolution of decision tree, designed for
enhancing model’s performance by reducing overfitting. It involves
creating multiple decision trees and aggregate the predictions of the
trees in order to produce the final prediction. caret
package only allow to tune the mtry parameter however I
want to illustrate that it is possible to expand the search manually to
find other parameters such as the ntree variable.
# Hyper-parameter search -------------------------------------------------------
# mtry; nº of variables randomly choose for building the trees
# Common practice for the grid is [1, sqrt(n_variables)]
rf_grid <- expand.grid(mtry = c(1, 2, 3, 4, 5))
# Training the model -----------------------------------------------------------
rf_fit <- train(class ~.,
method = "rf",
data = training_transf,
ntree = 200,
tuneGrid = rf_grid,
metric = "Accuracy",
trControl = ctrl)
print(rf_fit)
# Best values: mtry = 2
# Try different ntree values manually ------------------------------------------
model_rf_list <- list()
# Values I want to test (my search-space)
testing_ntree_var <- seq(200, 1000, by = 100)
for (ntree in testing_ntree_var){
fit <- train(class~.,
method = 'rf',
data = training_transf,
ntree = ntree,
tuneGrid = rf_grid,
metric = 'Accuracy',
trControl = ctrl)
key <- toString(ntree)
model_rf_list[[key]] <- fit
}
results_rf_manually <- resamples(model_rf_list)
summary(results_rf_manually)
dotplot(results_rf_manually)
# Best values: ntree = 1000
# Final search with ntree=1000 -------------------------------------------------
rf_fit_final <- train(class ~.,
method = "rf",
data = training_transf,
ntree = 1000,
tuneGrid = rf_grid,
metric = "Accuracy",
trControl = ctrl)
print(rf_fit_final)
# Save model -------------------------------------------------------------------
saveRDS(rf_fit_final, file = "pretrained_models/rf_fit_final.rds")
# Predictions on test ----------------------------------------------------------
rf_predF <- predict(rf_fit_final, X_test)
rf_predF_prob <- predict(rf_fit_final, X_test, type = "prob")
# Performance measures ---------------------------------------------------------
confusionMatrix(rf_predF, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ckd notckd
## ckd 48 0
## notckd 0 30
##
## Accuracy : 1
## 95% CI : (0.9538, 1)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6154
## Detection Rate : 0.6154
## Detection Prevalence : 0.6154
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : ckd
##
When testing the final model on the test set, the random forest algorithm showed the most promising results until now, reflecting the strength of the learning. Indeed, the accuracy as well as the kappa is of \(1\). Which might indicated that this model required a most complex dataset or/and the model is overfitting to the test data.
Performance of the model.
The visualization of hyper-parameter tuning is limited to the
mtry. And we can observed an increase in accuracy when
going from one to two predictors, indicating that considering at least
two features at each split significantly enhances the model’s. Also, it
is clear than a higher values does not improve the model neither.
The variable importance plot again highlights hemoglobin
(hemo) as a primary feature, consistent with earlier
findings using other models. This reinforces the thought that hemoglobin
levels are a probably the most important biomarker in the diagnosis of
chronic kidney disease. The model also identifies other key variables
that contribute significantly such as pcv or
sc.
Gradient Boosting is an advanced ensemble technique that builds a
series of decision trees sequentially. As other models, it is good
handling outliers as well as any type of predictor. However, ti can
overfit quickly and be time-consuming. In fact, it the most
computationally expensive method used over the project. In
caret we are able to tune several hyper-parameters,
including the learning rate, tree depth, …
# Hyper-parameter search -------------------------------------------------------
xgb_grid = expand.grid(
nrounds = c(100, 500),
eta = c(0.01, 0.1), # Learning rate
max_depth = c(3, 6), # Tree depth
gamma = c(0, 1),
colsample_bytree = c(0.5, 0.8), # Feature sampling
min_child_weight = c(1, 5),
subsample = c(0.7, 0.9)
)
dim(xgb_grid) # 128 combinaciones
# Training the model -----------------------------------------------------------
xgb_fit = train(class ~.,
data = training_transf,
tuneGrid = xgb_grid,
metric = "Accuracy",
method = "xgbTree",
trControl = ctrl
)
print(xgb_fit)
# Best values: nrounds = 100, max_depth = 6, eta = 0.1, gamma = 1, colsample_bytree = 0.8, min_child_weight = 1 and subsample = 0.9
# Save model -------------------------------------------------------------------
saveRDS(xgb_fit, file="pretrained_models/xgb_fit.rds")
# Predictions on test ----------------------------------------------------------
xgb_pred = predict(xgb_fit, X_test)
xgb_pred_prob = predict(xgb_fit, X_test, type = "prob")
# Performance measures ---------------------------------------------------------
confusionMatrix(xgb_pred, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ckd notckd
## ckd 47 1
## notckd 1 29
##
## Accuracy : 0.9744
## 95% CI : (0.9104, 0.9969)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : 4.373e-14
##
## Kappa : 0.9458
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9792
## Specificity : 0.9667
## Pos Pred Value : 0.9792
## Neg Pred Value : 0.9667
## Prevalence : 0.6154
## Detection Rate : 0.6026
## Detection Prevalence : 0.6154
## Balanced Accuracy : 0.9729
##
## 'Positive' Class : ckd
##
The model’s performance on the test set yielded an accuracy of \(97.44\)% and a kappa of \(94.58\)%. This high values, suggest that the xGB has achieved a near perfect fit to the data. Just a few mistakes have been committed, once for false positive and another for false negative.
Performance of the model.
Due to the high accuracy and kappa of this model, the hyperparameter tuning plot does not show a clear pattern that distinguishes between the configurations. Therefore, it is not included in the final report.
# Plot hyper-parameter tuning --------------------------------------------------
plot(xgb_fit, main = 'Hyper-parameter tuning - XGB')
We can confirm the dominance of hemoglobin (hemo) as the
most critical predictive feature, as seen with other models. Specific
gravity (sg) also represent a high importance, underlining
the consistency across different modeling approaches in identifying the
key factors for the presence or absence of the disease prediction.
Neural Networks offer the ability to capture complex non linear relationships between variables, but it is computationally expensive and it can easly guide us into overfitting. The “worst” thing is the difficulty to understand it often called as black-box tool.
# Hyper-parameter search -------------------------------------------------------
# size; nº of neurons in the hidden layer
nn_grid <- expand.grid(size=c(2,4,6),
decay=c(0.01,0.001))
dim(nn_grid)
# Training the model -----------------------------------------------------------
# NN with 1 hidden layer
nn_fit <- train(class ~.,
method = "nnet",
data = training_transf,
MaxNWts = 1000,
maxit = 100,
tuneGrid = nn_grid,
metric = "Accuracy",
trControl = ctrl)
print(nn_fit)
# Best values: size = 6 and decay = 0.001
# Save model -------------------------------------------------------------------
saveRDS(nn_fit, file="pretrained_models/nn_fit.rds")
# Predictions on test ----------------------------------------------------------
nn_pred = predict(nn_fit, X_test)
nn_pred_prob = predict(nn_fit, X_test, type = "prob")
# Performance measures ---------------------------------------------------------
confusionMatrix(nn_pred, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ckd notckd
## ckd 47 0
## notckd 1 30
##
## Accuracy : 0.9872
## 95% CI : (0.9306, 0.9997)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : 1.779e-15
##
## Kappa : 0.9731
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9792
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9677
## Prevalence : 0.6154
## Detection Rate : 0.6026
## Detection Prevalence : 0.6026
## Balanced Accuracy : 0.9896
##
## 'Positive' Class : ckd
##
The confusion matrix, shows how well this models is working due to its high accuracy and kappa. The errors made are minimum; one mistake is done when predicting the patient does not have ckd and in reality the patient is sick (has ckd). This mistake represent the worst error that our model can committee.
Performance of the model.
The following pattern can be reach, when the weight decay is fixed to \(0.001\) the accuracy improves as the number of hidden neurons increase from \(2\) to \(6\). However, this trend is not strictly follows when weight decay is fixed into \(0.01\). Then it is obvious that varying one parameter affect the configuration for the rest search variables.
Finally, we can identify an unexpected behavior on the variable
importance plot for this model. age variable is set as the
most influential parameter. However this does not have any sense in
comparing with the previous results. Such anomaly lead us into a re
examination of the data preprocessing steps, particularly the scaling of
input variables where we can find the mistake.
In neural networks, feature scaling has a significant impact for all
inputs contribute equally to the model’s learning process. When not
scaling the age input we are causing to make in this
variable more important. Despite this issue, hemo remains
to be the second most important variable over the dataset, which align
with the results obtained until now.
Deep Neural Networks extend traditional neural networks by incorporating multiple hidden layers, offering the ability to model highly complex relationships within the dataset. We can adjust this hidden layers, for making more accurate predictions.
# Hyper-parameter search -------------------------------------------------------
# layer1, layer2, layer3: nº of neuron in each hidden layers
dnn_grid <- expand.grid(layer1 = c(5, 10, 15),
layer2 = c(0, 5, 10),
layer3 = c(0, 5),
hidden_dropout = c(0.1, 0.2), #dense neuron network
visible_dropout = c(0.1, 0.2))
# dim(dnn_grid) # 72 combinations
# Training the model -----------------------------------------------------------
# NN with 1 hidden layer
dnn_fit <- train(class ~.,
method = "dnn",
data = training_transf,
numepochs = 20, # nº of iterations on the whole training set
tuneGrid = dnn_grid,
metric = "Accuracy",
trControl = ctrl)
print(dnn_fit)
# Best values: layer1 = 15, layer2 = 0, layer3 = 0, hidden_dropout = 0.2 and visible_dropout = 0.1
# Save model -------------------------------------------------------------------
saveRDS(dnn_fit, file="pretrained_models/dnn_fit.rds")
# Predictions on test ----------------------------------------------------------
dnn_pred = predict(dnn_fit, X_test)
dnn_pred_prob = predict(dnn_fit, X_test, type = "prob")
# Performance measures ---------------------------------------------------------
confusionMatrix(dnn_pred, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction ckd notckd
## ckd 41 1
## notckd 7 29
##
## Accuracy : 0.8974
## 95% CI : (0.8079, 0.9547)
## No Information Rate : 0.6154
## P-Value [Acc > NIR] : 2.367e-08
##
## Kappa : 0.7912
##
## Mcnemar's Test P-Value : 0.0771
##
## Sensitivity : 0.8542
## Specificity : 0.9667
## Pos Pred Value : 0.9762
## Neg Pred Value : 0.8056
## Prevalence : 0.6154
## Detection Rate : 0.5256
## Detection Prevalence : 0.5385
## Balanced Accuracy : 0.9104
##
## 'Positive' Class : ckd
##
With this model, we can appreciated how the metrics are a little bit lower: accuracy of \(89.74\)% and kappa of \(79.12\)%, although they are still pretty high. When focus on the error performance, we can realised that this model is making mos of the mistakes over what we have define as the worst one (predicting notckd when the patient actually has ckd). Hence, in terms of errors we can conclude this is our worst model.
Performance of the model.
The following plots showcases how the model’s accuracy responds to changes in the network’s architecture. Generally, increasing the number of neurons seems to increase accuracy.
The variable importance plot indicate how deep neural networks
consider relevant most of the variables in contrast with other models
such as the decision trees. Nevertheless, same features as before are
located on the top relevant position (hemo,
pcvand sg).
The aim of this section is to provide a brief summary of the performance of all models implemented in this project. In general, the models have demonstrated outstanding results, with strong consistency across metrics performance such as the accuracy and the kappa.
The best performance are achieved by support vector machine and random forest, with RF reaching a perfect accuracy on the testing set of \(1\). However, SVM display lower variance during the validation, which suggest it may offer more reliable behavior across different data tries. It is worth mentioning the presence of outliers in the decision tree using rpart as well as xGB from gradient boosting. Thus this might be indicate higher variability on the training data thus its behaviors sometimes can be unexpected. By exploring the dotplot, it is clear that most of the models, with the exception of kNN and DNN, are able to get very high accuracy. While high accuracy is desirable, it could also indicate that the models are overfitting, especially when the dataset is no enough complex that can be perfectly the situation for this dataset. Additionally, both decision tress coded (rpart and C5.0) and the gradient boosting yielded identical performance metrics on the test set hence they have similar predictive capabilities.
On the other hand, neural network model has not work
as well as I was expecting. Nevertheless, it can be due to a
preprocessing typo where the agevariables was not scaled.
While correcting this would likely change the model outcomes, this
oversiht has underscored the relevance of featuring scaling in model
preprocessing. Specially, those models that are highly sensitivity to
the magnitude of the inputs.
While accuracy is often our primary focus, it is essential to recognize that it may not always be the most beneficial metric. In the Social impact section, we will see the advantages of prioritizing models that, while are less accurate, they produced better outcomes from a social perspective.
Finally, it has been include a special section named Scores on the testing set section where it can be visualized all scores obtained on the testing set for each models.
# Compare model performance ----------------------------------------------------
models_compare <- resamples(list(KNN = knn_fit,
SVM = svm_fit,
DT.rpart = rpart_fit,
DT.c50 = c50_fit,
RF = rf_fit_final,
xGB = xgb_fit,
NN = nn_fit,
DNN = dnn_fit)
)
# Analytical summary
#summary(models_compare)
# Box-and-Whisker Plot
bwplot(models_compare, metric = "Accuracy",
main = "Model comparison based on Accuracy")
# Dotplot
dotplot(models_compare, metric = "Accuracy",
main = "Model Comparison based on Accuracy")
This project is focus on analyzing the presence or absence of chronic
kidney disease among \(24\) variables.
The initial step centered on understanding the dataset as well as
preparing it for the machine learning tools. It is worth mentioning the
utilization of the mice.reuse function for data imputation,
which enable the generation of more reliable values for NA’s.
Throughout this task, a total of \(8\) models were developed, and their
performance was evaluated across various metrics. While detailed
insights have been previously discussed in the Overview models section, it’s worth repeating
some key ideas. Decision tree based models yielded into similar results,
with three of them producing identical confusion matrices on the test
set. The Random Forest model achieved perfect accuracy, thus it has been
the best model. A rare and unexpected behavior in the Deep Neural
Networks highlighted a preprocessing typo, where the variable
age was not scaled, affecting the model’s performance.
Despite this, all models demonstrated exceptional accuracy and kappa
scores.
The last part aim to incorpore the social impact on the kNN model, demonstrating that not all errors are equal in their implications for patient outcomes. For instance, the “worst” mistake is failing at predicting a patient is not having chronic kidney disease but they actually has. This can clearly affect a patient’s quality of life. By incorporating social impact considerations into our evaluation, we illustrated that a model with slightly lower accuracy could offer greater benefits by minimizing the most worst errors. This procedure can be done by adjusting the prediction threshold.
In summary, this project yield the necessity of thorough
preprocessing and a deep understanding of the data we are dealing with.
We leveraged a variety of R tools to analyze and visualize
the machine learning models and their behaviors (e.g., variable
importance plots), enriching our insights on the models.
Disclaimer. Full source code, pre-trained models, and adjusted dataset can be found on my GitHub repository (click here to access).
Until now, we have always explore the confusion matrix which a useful analytical tool. However, a more visually manner can be generated with a plot for all models with the scores of the predicted values for each patient. For clarity, i have grouped all individual on the two possible outcome. The point represent each predictions with the model we are evaluating, where blue points are the healthy patients and red points the ones that has chronic kidney disease. The dashed horizontal line at 0.5 serves as a decision threshold, indicating the level at which the model switches from predicting the two possible outcomes. This allow us to give a visually representation of how the model is predicting.
ROC curve display for all models.
6 Social impact
From a medical point of view, accurately diagnosing chronic kidney disease is an important decision due to the severe implications associated with its misdiagnosis. This section focus on the social impact analysis using the k-Nearest Neighbors (kNN) model, chosen due to its tendency to commit a relatively high number of critical errors, failing to detect the presence of ckd in affected patients.
When a model predict wrongly, we must take into account that not all mistakes carry the same social consequences. Incorrectly identify a patient with ckd may lead to unnecessary stress, medical interventions, among others. However, failing to diagnose the presence of this illness when the patient is suffering from ckd, can have far more repercussions. This will lead us into the absence of medical treatment and significantly decreasing the patient’s quality of life as well as its life expectancy. Therefore, we can conclude, that there are mistakes that are “worst” than others and in our situation this error is when the model is no able to identify the presence of ckd when the patient actually have it.
For the following analysis some assumption are need to be done.
If the model correctly diagnose a patience as not having chronic kidney disease, this will results in no change to their quality of life (\(100\)% quality of life still).
If the model correctly diagnose a patience has chronic kidney disease, this will turn into the proper treatment, maintaining their quality of life (\(100\)% of quality life).
If the model predict a patient has ckd when they are healthy. This situation leads to a decrease in quality of life due to unnecessary medical treatment and psychological stress. However, it may be an advantages thanks to a special care on the patient (-\(3\)% quality of life).
If the model fails to diagnose chronic kidney disease in an affected patient. The consequences are huge, because the lack of treatment (with a 90% decrease in quality of life).
Given these assumptions, our goal is to adjust the model’s classification threshold to minimize the most socially costly errors (false negatives). This involves a series of steps where we iteratively adjust the threshold for classifying a patient as having ckd and evaluate the social impact based on the quality of life adjustments defined above.
The width of the boxes on the boxplot represent the variability of the unit social impact across the different tries of the hyper-parameter tuning. A smaller width means a more reliable model’s performance. Therefore, in this situation, the optimal threshold is \(0.95\).
Finally, once we have computed the optimal threshold we can use it in order to train our model and make predictions over the test set. This will turn into a worst accuracy model but our quality life of the patient will be improved.
As our initial though the quality of life of a patient is improved but we create a worse model regarding to the accuracy. However, due to the importance of correctly diagnose a patient in order to improve its life, there is no doubt that this model even though it makes more mistakes it is much beneficial. Final remark, it is that sometimes we must be really conservative when setting the threshold for trying to avoid false negative (specially in medical issues).
Initial model.